-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed training #8
base: main
Are you sure you want to change the base?
Conversation
LGTM - @matthewdeng to take a look |
@matthewdeng sorry meant to request you. please review and MERGE!!! |
|
||
To run: | ||
```bash | ||
python pytorch.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest having a slightly more descriptive name here, e.g. train_torch_model.py
.
### Monitor | ||
After launching the script, you can look at the Ray dashboard. It can be accessed from the Workspace home page and enables users to track things like CPU/GPU utilization, GPU memory usage, remote task statuses, and more! | ||
|
||
![Dash](https://github.com/anyscale/templates/releases/download/media/workspacedash.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This image is highlighting VSCode 😅
[See here for more extensive documentation on the dashboard.](https://docs.ray.io/en/latest/ray-observability/getting-started.html) | ||
|
||
### Model Saving | ||
The model will be saved in the Anyscale Artifact Store, which is automatically set up and configured with your Anyscale deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we point to Anyscale documentation here? I feel like this is introducing a new concept for something that should be more simple (a cloud storage bucket).
```bash | ||
gsutil ls $ANYSCALE_ARTIFACT_STORAGE | ||
``` | ||
Authentication is automatcially handled by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Authentication is automatcially handled by default. | |
Authentication is automatically handled by default. |
### Submit as Anyscale Production Job | ||
From within your Anyscale Workspace, you can run your script as an Anyscale Job. This might be useful if you want to run things in production and have a long running job. You can test that each Anyscale Job will spin up its own cluster (with the same compute config and cluster environment as the Workspace) and run the script. The Anyscale Job will automatically retry in event of failure and provides monitoring via the Ray Dashboard and Grafana. | ||
|
||
To submit as a Production Job you can run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have consistency in the naming? We use "Anyscale Production Job", "Anyscale Job", and "Production Job" here - it may not be obvious to the user that all three of these combinations are meant to be the same thing 😄
for _ in range(epochs): | ||
train_epoch(train_dataloader, model, loss_fn, optimizer) | ||
loss = validate_epoch(test_dataloader, model, loss_fn) | ||
session.report(dict(loss=loss)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should save a checkpoint here 😄
parser.add_argument( | ||
"--smoke-test", | ||
action="store_true", | ||
default=False, | ||
help="Finish quickly for testing.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used.
min_workers: 1 | ||
max_workers: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these values really make sense with the current training script. If we are showing distributed training with 2 GPUs, I think we should either have min_workers
be 2 (to make the script run immediately) or 1 (if we want to show autoscaling).
parser.add_argument( | ||
"--use-gpu", action="store_true", default=True, help="Enables GPU training" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes it so that this value is always true?
No description provided.